Spark: Distinct

distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame.

val Data = Seq(

("James", "Sales", 3000),

("Michael", "Sales", 4600),

("Robert", "Sales", 100),

("Maria", "Finance", 3000),

("James", "Sales", 3000),

("Scott", "Finance", 3300),

("Jen", "Finance", 3900),

("Jeff", "Marketing", 3000),

("Kumar", "Marketing", 2000),

("Saif", "Sales", 4100))

val df = Data.toDF("employee_name", "department", "salary")

Select department name form DataFrame

df.select("department").show()

Select unique records form DataFrame

We can use distinct() function to remove the duplicate rows of a DataFrame and get the DataFrame which won’t have duplicate rows.

df.select($"department").distinct().show

You can also use dropDuplicates to get unique values,
We can use dropDuplicates operation to drop the duplicate rows of a DataFrame and get the DataFrame which won’t have duplicate rows.

df.select($"department").dropDuplicates().show

Count the unique records from DataFrame

#pyspark

Data = [ ("James", "Sales", 3000), \

("Michael", "Sales", 4600), \

("Robert", "Sales", 100), \

("Maria", "Finance", 3000), \

("James", "Sales", 3000), \

("Scott", "Finance", 3300), \

("Jen", "Finance", 3900), \

("Jeff", "Marketing", 3000), \

("Kumar", "Marketing", 2000), \

("Saif", "Sales", 4100)]

columns= ["empno","job", "sal"]

df = spark.createDataFrame(data = Data, schema = columns)

df.count()

Out[26]: 15

df.distinct().count()

Out[27]: 9

Spark

Distinct

No comments:

Post a Comment